logo UniversiData_hor_black.png
Welcome to the 2025 Hackathon!

Overview

In this Hackathon you will be required to estimate the value of a building with respect to the possibility of installing solar panels on it.

You will be given satellite images, and you will have to detect all buildings and then predict their value.

Objective

Your technical objective is two-fold:

objective 1 : detect as many as possible true buildings in the test area

objective 2 : correctly predict the values of the detected buildings in the test area

Winning criteria

Your score will be evaluated as a weighted average of the scores by a technical and a business panel:

$$ \text{final score} = 60\%~\text{technical score} + 40\%~\text{business score}$$

Technical score [max: 1 point]

The technical score relates to the performance of your code $-$ You can collect points in this way:

  • +0.3 points : You detected more than 35% of the true buildings in the test area
  • +0.3 points : Your prediction model scores a Mean Squared Error (MSE) lower than 2.0 on the test area
  • +0.2 points : You detected more true buildings than the other participants
  • +0.2 points : Your prediction model scores a better MSE than the other participants

Business score [max: 1 point]

The business score relates to the presentation of your project $-$ Your presentation shall address:

  • a brief explanation of the technical solution
  • how can the analysis can be exploited in buisiness context

Working groups

You will work in groups (8/10 people), each group competeing for the victory as a whole.

  • Each team shall be comprised by an equal amount of participants from the begineer, intermediate, and advanced classes
  • Each team shall elect a group leader

Let's look at the data¶

The data is obtained from a satelite image from a section of the city of Boston:

Out[2]:
No description has been provided for this image

What are these data?¶

  • Left $-$ The satelite image
  • Center $-$ The mask showing the ground-truth location of all the buildings
  • Right $-$ The value map, showing how important is each building for the installation of solar panels

As you can see, we segmented these images on a grid of $5 \times 5 = 25$ splits: we will call them "samples".

  • The first 24 samples will constitute your training data
  • The last 1 sample will constitute your test data (highlighted in green)

Let's load all samples and look at a few of them:

5 examples [out of 24] of training areas/masks/maps:
No description has been provided for this image

Dataframe of ground truth patches: each row is a patch in any of the images
image_ID patch_ID centroid_x centroid_y patch_area bbox value
0 0 1 269.333333 146.466667 30.0 (143, 265, 151, 275) 22.932174
1 0 2 217.953488 143.604651 43.0 (139, 214, 149, 223) 24.711937
2 0 3 298.133333 126.600000 15.0 (124, 296, 131, 300) 26.183018
3 0 4 253.434126 232.372484 8744.0 (164, 197, 300, 300) 48.327791
4 1 1 3.054054 128.459459 74.0 (122, 0, 135, 9) 26.183018
... ... ... ... ... ... ... ...
1359 23 137 295.182796 294.698925 93.0 (286, 288, 300, 300) 33.782497
1360 23 138 1.982456 15.649123 57.0 (10, 0, 24, 7) 35.016672
1361 23 139 83.828341 57.513249 1736.0 (18, 43, 91, 121) 36.517404
1362 23 140 293.501425 147.059829 351.0 (129, 282, 167, 300) 38.099080
1363 23 141 129.060976 2.134146 164.0 (0, 115, 7, 149) 39.443048

1364 rows × 7 columns


Test image on which to infer the patch values:
No description has been provided for this image

What do you receive?¶

The code above has already loaded the following variables:

    X_train_areas = bundle_participants["X_train_areas"]
    X_train_msks  = bundle_participants["X_train_msks"]
    X_train_maps  = bundle_participants["X_train_maps"]
    df_train      = bundle_participants["df_train"]
    X_test_area = bundle_participants["X_test_area"]

Train:

  • X_train_areas : list containing the 24 area images
  • X_train_msks : list containing the 24 pixel masks [0 or 1]
  • X_train_maps : list containing the 24 value maps [each building has a different value]
  • df_train: dataframe containing the true building values for each building patch of pixels
Column Description
image_ID Index of the sample image from which the patch was extracted (0–23).
patch_ID Identifier of the building patch within that image.
centroid_x X-coordinate of the patch centroid.
centroid_y Y-coordinate of the patch centroid.
patch_area Number of pixels belonging to the patch.
bbox Bounding box of the patch in the format (ymin, xmin, ymax, xmax).
value Value associated with the building.

Test:

  • X_test_area : test area image

Task¶

You will have to complete the following tasks.

Task 1 $-$ Building segmentation¶

[ intermediate/advanced participants ]

Use computer vision to detect and characterize all the buildings.

  1. Use SAM (Segment Anything Model) to find all the buildings in the images
      (You may post process the segmentation to remove false detections if needed)

  2. For each patch, associate a corresponding value from the training value maps
      (e.g., you can use the value of the closest true building)

  3. Create a dataframe df_segmentations, with the same columns as df_train, but with the values you obtained from steps 1. and 2.

image_ID patch_ID centroid_x centroid_y patch_area bbox value
… … … … … (…, …, …, …) …
… … … … … (…, …, …, …) …

Important: This dataframe shall also contain the data for the test image $-$ Of course, you shall set the value to 0 since this is unknown for the test.

Task 2 $-$ Regression¶

[ beginner/intermediate participants ]

Train a regression model on df_segmentations, that uses any sub-set of the columns to predict column value.

Hint: You will receive df_segmentations from Task 1, but, until your team mates are done, you can use df_train, since it has exactly the same structure.


Evaluation¶

We will evaluate as follows:

Task 1: We will check how many bounding boxes match with the ground truth, for the test image.

Task 2: We will apply your trained model to the dataframe, and evaluate the MSE.

Expected outputs¶

For this evaluation to happen, it is fundamental that you pass us your results in the correct format.

💾 image

  • In image ($\texttt{.png}$ or $\texttt{.jpg}$) showing the overlay or mask of all segmentations detected on the test image.

💾 dataset

  • The dataset df_segmentations shall be sent to us as a $\texttt{.csv}$ file named "$\texttt{df}$_$\texttt{segmentations.csv}$"
  • It shall have exactly and only the columns mentioned above

💾 trained model

  • The trained model shall be sent to us as a joblib file named "$\texttt{model.pkl}$"

    import joblib
    joblib.dump(model, "model.pkl")

  • You shall also tell us which features of df_segmentation you used, e.g.:

    ["patch_ID", "centroid_x", "patch_area"]

    We will extract the features from a test dataframe with the same columns as df_segmentation $\rightarrow$ you need to make sure that we can just run:

    X_test = df_test[["patch_ID", "centroid_x", "patch_area"]]
    model.predict(X_test)


⚠️ Warning ⚠️

We will not be able to debug your code! $\rightarrow$ Make sure you comply to these formats!